智能论文笔记

Identifying Machine-Paraphrased Plagiarism

Jan Philip Wahle , Terry Ruas , Tomáš Foltýnek , Norman Meuschke , Bela Gipp

分类：自然语言处理 | 人工智能

2021-03-22

雇用措施恳求抄袭文本的措施是对学术诚信的严重威胁。要启用检测机释录的文本，我们会评估五个预先训练的单词嵌入模型的有效性与机器学习分类器和最先进的神经语言模型相结合。我们分析了研究论文，毕业论文和维基百科文章的预印刷品，我们使用不同的工具SpinBot和Spinnerchief释放。最佳的表演技术，啰素，平均F1得分为80.99％（F1 = 99.68％，纺纱病例的F1 = 71.64％），而人类评估员均达到纺纱病例的F1 = 78.4％，F1 = 65.6％的纺纱病例。我们表明，自动分类减轻了广泛使用的文本匹配系统的缺点，例如金风格和Plagscan。为了促进未来的研究，所有数据，代码和两个展示我们贡献的Web应用程序都公开使用。

translated by 谷歌翻译

Independent Components of Word Embeddings Represent Semantic Features

Tomáš Musil , David Mareček

分类：自然语言处理

2022-12-19

Independent Component Analysis (ICA) is an algorithm originally developed for finding separate sources in a mixed signal, such as a recording of multiple people in the same room speaking at the same time. It has also been used to find linguistic features in distributional representations. In this paper, we used ICA to analyze words embeddings. We have found that ICA can be used to find semantic features of the words and these features can easily be combined to search for words that satisfy the combination. We show that only some of the independent components represent such features, but those that do are stable with regard to random initialization of the algorithm.

translated by 谷歌翻译

Flowstorm: Open-Source Platform with Hybrid Dialogue Architecture

Jan Pichl , Petr Marek , Jakub Konrád , Petr Lorenc , Ondřej Kobza , Tomáš Zajíček , Jan Šedivý

分类：人工智能

2022-12-19

This paper presents a conversational AI platform called Flowstorm. Flowstorm is an open-source SaaS project suitable for creating, running, and analyzing conversational applications. Thanks to the fast and fully automated build process, the dialogues created within the platform can be executed in seconds. Furthermore, we propose a novel dialogue architecture that uses a combination of tree structures with generative models. The tree structures are also used for training NLU models suitable for specific dialogue scenarios. However, the generative models are globally used across applications and extend the functionality of the dialogue trees. Moreover, the platform functionality benefits from out-of-the-box components, such as the one responsible for extracting data from utterances or working with crawled data. Additionally, it can be extended using a custom code directly in the platform. One of the essential features of the platform is the possibility to reuse the created assets across applications. There is a library of prepared assets where each developer can contribute. All of the features are available through a user-friendly visual editor.

translated by 谷歌翻译

Using Set Covering to Generate Databases for Holistic Steganalysis

Rony Abecidan , Vincent Itier , Jérémie Boulanger , Patrick Bas , Tomáš Pevný

分类：计算机视觉

2022-11-07

Within an operational framework, covers used by a steganographer are likely to come from different sensors and different processing pipelines than the ones used by researchers for training their steganalysis models. Thus, a performance gap is unavoidable when it comes to out-of-distributions covers, an extremely frequent scenario called Cover Source Mismatch (CSM). Here, we explore a grid of processing pipelines to study the origins of CSM, to better understand it, and to better tackle it. A set-covering greedy algorithm is used to select representative pipelines minimizing the maximum regret between the representative and the pipelines within the set. Our main contribution is a methodology for generating relevant bases able to tackle operational CSM. Experimental validation highlights that, for a given number of training samples, our set covering selection is a better strategy than selecting random pipelines or using all the available pipelines. Our analysis also shows that parameters as denoising, sharpening, and downsampling are very important to foster diversity. Finally, different benchmarks for classical and wild databases show the good generalization property of the extracted databases. Additional resources are available at github.com/RonyAbecidan/HolisticSteganalysisWithSetCovering.

translated by 谷歌翻译

AARGH! End-to-end Retrieval-Generation for Task-Oriented Dialog

Tomáš Nekvinda , Ondřej Dušek

分类：自然语言处理

2022-09-08

我们介绍了AARGH，这是一个面向任务的对话框系统，该系统结合了单个模型中的检索和生成方法，旨在改善对话框管理和输出的词汇多样性。该模型采用了一种新的响应选择方法，该方法基于动作感知训练目标和简化的单编码检索架构，该方法使我们能够构建端到端检索增强生成模型，在该模型中，检索和生成共享大多数参数。在Multiwoz数据集上，我们表明我们的方法与最先进的基线相比，在维持或改善状态跟踪和上下文响应生成性能的同时，产生了更多的输出。

translated by 谷歌翻译

Denoising Architecture for Unsupervised Anomaly Detection in Time-Series

Wadie Skaf , Tomáš Horváth

分类：机器学习 | 人工智能

2022-08-30

时间序列的异常提供了各个行业的关键方案的见解，从银行和航空航天到信息技术，安全和医学。但是，由于异常的定义，经常缺乏标签以及此类数据中存在的极为复杂的时间相关性，因此识别时间序列数据中的异常尤其具有挑战性。LSTM自动编码器是基于长期短期内存网络的异常检测的编码器传统方案，该方案学会重建时间序列行为，然后使用重建错误来识别异常。我们将Denoising Architecture作为对该LSTM编码模型模型的补充，并研究其对现实世界以及人为生成的数据集的影响。我们证明了所提出的体系结构既提高了准确性和训练速度，从而使LSTM自动编码器更有效地用于无监督的异常检测任务。

translated by 谷歌翻译

Object Detection Using Sim2Real Domain Randomization for Robotic Applications

Dániel Horváth , Gábor Erdős , Zoltán Istenes , Tomáš Horváth , Sándor Földi

分类：机器人 | 计算机视觉

2022-08-08

在非结构化环境中工作的机器人必须能够感知和解释其周围环境。机器人技术领域基于深度学习模型的主要障碍之一是缺乏针对不同工业应用的特定领域标记数据。在本文中，我们提出了一种基于域随机化的SIM2REAL传输学习方法，用于对象检测，可以自动生成任意大小和对象类型的标记的合成数据集。随后，对最先进的卷积神经网络Yolov4进行了训练，以检测不同类型的工业对象。通过提出的域随机化方法，我们可以在零射击和单次转移的情况下分别缩小现实差距，分别达到86.32％和97.38％的MAP50分数，其中包含190个真实图像。在GEFORCE RTX 2080 TI GPU上，数据生成过程的每图像少于0.5 s，培训持续约12H，这使其方便地用于工业使用。我们的解决方案符合工业需求，因为它可以通过仅使用1个真实图像进行培训来可靠地区分相似的对象类别。据我们所知，这是迄今为止满足这些约束的唯一工作。

translated by 谷歌翻译

Explaining Classifiers Trained on Raw Hierarchical Multiple-Instance Data

Tomáš Pevný , Viliam Lisý , Branislav Bošanský , Petr Somol , Michal Pěchouček

分类： (统计)机器学习 | 机器学习

2022-08-04

从原始数据输入中学习，因此限制了功能工程的需求，是机器学习方法在各个域中的许多成功应用的组成部分。尽管许多问题自然地转化为直接在标准分类器中使用的矢量表示形式，但许多数据源具有结构化数据互换格式的自然形式（例如，以JSON/XML格式使用的安全日志）。现有方法，例如在层次多实例学习（HMIL）中，允许以原始形式从此类数据中学习。但是，对原始结构化数据培训的分类器的解释仍然在很大程度上尚未探索。通过将这些模型视为子集选择问题，我们证明了如何使用计算有效算法来生成具有优惠属性的可解释解释。我们与图形神经网络采用的解释技术进行比较，该技术显示了速度加速和更高质量的解释的顺序。

translated by 谷歌翻译

Teachers in concordance for pseudo-labeling of 3D sequential data

Awet Haileslassie Gebrehiwot , Patrik Vacek , David Hurych , Karel Zimmermann , Patrick Perez , Tomáš Svoboda

分类：计算机视觉 | 机器人

2022-07-13

自动伪标记是一种强大的工具，可以利用大量的连续未标记数据。在绩效要求非常大，数据集和手动标记的自动驾驶的关键安全应用中，它特别有吸引力。我们建议利用捕获的顺序性，通过培训多个教师在教师的设置中提高伪标记技术，每个教师都可以访问不同的时间信息。这套被称为一致性的教师比标准方法为学生培训提供了更高质量的伪标签。多个教师的输出通过新颖的伪标记信心引导的标准组合。我们的实验评估集中在城市驾驶场景中的3D点云域。我们显示了我们的方法的性能，应用于多个模型体系结构，其中包含3D语义分割任务和两个基准数据集上的3D对象检测。我们的方法仅使用20％的手动标签，优于某些完全监督的方法。对于培训数据，例如自行车和行人，很少出现在培训数据中的课程方面的特殊表现提升。我们的方法的实现可在https://github.com/ctu-vras/t-concord3d上公开获得。

translated by 谷歌翻译

Are Hitting Formulas Hard for Resolution?

Tomáš Peitl , Stefan Szeider

分类：人工智能

2022-06-30

伊瓦玛（Iwama）引入的命中公式是一类不寻常的命题CNF公式。它们的可满足性不仅可以在多项式时间内确定，而且甚至可以以封闭形式计算其模型。这与其他多项式定义类别形成鲜明对比，这些类别通常具有基于回溯和分辨率的算法，并且模型计数仍然很难，例如2-SAT和HORN-SAT。但是，那些基于分辨率的算法通常很容易地暗示着在分辨率复杂性上的上限，这对于达到公式而缺少。击中公式难以解决吗？在本文中，我们采取了第一步，回答这个问题。我们表明，击中公式的分辨率复杂性由Kullmann和Zhao首先研究的所谓不可约合的击球公式主导，这些配方不能由较小的击球公式组成。但是，根据定义，很难构建大型不可理解的击中公式。甚至还不知道是否存在无限的许多。基于我们的理论结果，我们在Nauty软件包之上实施了有效的算法，以列举所有不可约14个条款的不可约束的击中公式。我们还通过将已知的SAT编码用于我们的目的来确定生成的击中公式的确切分辨率复杂性。我们的实验结果表明，击中公式确实很难解决。

translated by 谷歌翻译